Goto

Collaborating Authors

 base table


Graph-Based Feature Augmentation for Predictive Tasks on Relational Datasets

Qiao, Lianpeng, Cao, Ziqi, Feng, Kaiyu, Yuan, Ye, Wang, Guoren

arXiv.org Artificial Intelligence

Data has become a foundational asset driving innovation across domains such as finance, healthcare, and e-commerce. In these areas, predictive modeling over relational tables is commonly employed, with increasing emphasis on reducing manual effort through automated machine learning (AutoML) techniques. This raises an interesting question: can feature augmentation itself be automated and identify and utilize task-related relational signals? To address this challenge, we propose an end-to-end automated feature augmentation framework, ReCoGNN, which enhances initial datasets using features extracted from multiple relational tables to support predictive tasks. ReCoGNN first captures semantic dependencies within each table by modeling intra-table attribute relationships, enabling it to partition tables into structured, semantically coherent segments. It then constructs a heterogeneous weighted graph that represents inter-row relationships across all segments. Finally, ReCoGNN leverages message-passing graph neural networks to propagate information through the graph, guiding feature selection and augmenting the original dataset. Extensive experiments conducted on ten real-life and synthetic datasets demonstrate that ReCoGNN consistently outperforms existing methods on both classification and regression tasks.


FeatNavigator: Automatic Feature Augmentation on Tabular Data

Liang, Jiaming, Lei, Chuan, Qin, Xiao, Zhang, Jiani, Katsifodimos, Asterios, Faloutsos, Christos, Rangwala, Huzefa

arXiv.org Artificial Intelligence

Data-centric AI focuses on understanding and utilizing high-quality, relevant data in training machine learning (ML) models, thereby increasing the likelihood of producing accurate and useful results. Automatic feature augmentation, aiming to augment the initial base table with useful features from other tables, is critical in data preparation as it improves model performance, robustness, and generalizability. While recent works have investigated automatic feature augmentation, most of them have limited capabilities in utilizing all useful features as many of them are in candidate tables not directly joinable with the base table. Worse yet, with numerous join paths leading to these distant features, existing solutions fail to fully exploit them within a reasonable compute budget. We present FeatNavigator, an effective and efficient framework that explores and integrates high-quality features in relational tables for ML models. FeatNavigator evaluates a feature from two aspects: (1) the intrinsic value of a feature towards an ML task (i.e., feature importance) and (2) the efficacy of a join path connecting the feature to the base table (i.e., integration quality). FeatNavigator strategically selects a small set of available features and their corresponding join paths to train a feature importance estimation model and an integration quality prediction model. Furthermore, FeatNavigator's search algorithm exploits both estimated feature importance and integration quality to identify the optimized feature augmentation plan. Our experimental results show that FeatNavigator outperforms state-of-the-art solutions on five public datasets by up to 40.1% in ML model performance.


Retrieve, Merge, Predict: Augmenting Tables with Data Lakes

Cappuzzo, Riccardo, Varoquaux, Gael, Coelho, Aimee, Papotti, Paolo

arXiv.org Artificial Intelligence

We present an in-depth analysis of data discovery in data lakes, focusing on table augmentation for given machine learning tasks. We analyze alternative methods used in the three main steps: retrieving joinable tables, merging information, and predicting with the resultant table. As data lakes, the paper uses YADL (Yet Another Data Lake) -- a novel dataset we developed as a tool for benchmarking this data discovery task -- and Open Data US, a well-referenced real data lake. Through systematic exploration on both lakes, our study outlines the importance of accurately retrieving join candidates and the efficiency of simple merging methods. We report new insights on the benefits of existing solutions and on their limitations, aiming at guiding future research in this space.


Optimize your Amazon Redshift question efficiency with automated materialized views - Channel969

#artificialintelligence

Amazon Redshift is a quick, totally managed cloud knowledge warehouse database that makes it cost-effective to investigate your knowledge utilizing normal SQL and enterprise intelligence instruments. Amazon Redshift permits you to analyze structured and semi-structured knowledge and seamlessly question knowledge lakes and operational databases, utilizing AWS designed {hardware} and automatic machine studying (ML)-based tuning to ship top-tier price-performance at scale. Though Amazon Redshift offers glorious value efficiency out of the field, it presents further optimizations that may enhance this efficiency and help you obtain even sooner question response instances out of your knowledge warehouse. For instance, you may bodily tune tables in an information mannequin to reduce the quantity of knowledge scanned and distributed inside a cluster, which accelerates operations corresponding to desk joins and range-bound scans. Amazon Redshift now automates this tuning with the computerized desk optimization (ATO) function.


How to Split and Sample a Dataset in BigQuery Using SQL

#artificialintelligence

Splitting data means that we will divide it into subsets. For data science models, datasets are usually partitioned into two or three subsets: training, validation, and test. Each subset of data has a purpose, from creating a model to ensuring its performance. To decide on the size of each subset, we often see standard rules and ratios. There have been some discussions about what an optimal split might be, but in general, I would recommend keeping in mind that not having enough data, either on the training or validation set, will result in a model that is difficult to learn/train, or you will have difficulty determining whether this model actually performs well or not. It's worth noting that you don't always have to make three segments.


Predict House Prices with Machine Learning

#artificialintelligence

A lot of feature engineering rests on domain expertise. If you have a subject matter expert (SME) on real estate to provide guidance, you'll have a better chance of engineering some awesome feature that will really make your modelling shine. The following code creates these new features. The new property_age feature arguably supercedes the original tx_year and year_built, thus we'll remove them. Analytical base table: The dataset after applying all of these data cleaning steps and feature engineering is our analytical base table.


Predict House Prices with Machine Learning

#artificialintelligence

A lot of feature engineering rests on domain expertise. If you have a subject matter expert (SME) on real estate to provide guidance, you'll have a better chance of engineering some awesome feature that will really make your modelling shine. The following code creates these new features. The new property_age feature arguably supercedes the original tx_year and year_built, thus we'll remove them. Analytical base table: The dataset after applying all of these data cleaning steps and feature engineering is our analytical base table.


ARDA: Automatic Relational Data Augmentation for Machine Learning

Chepurko, Nadiia, Marcus, Ryan, Zgraggen, Emanuel, Fernandez, Raul Castro, Kraska, Tim, Karger, David

arXiv.org Machine Learning

Automatic machine learning (\AML) is a family of techniques to automate the process of training predictive models, aiming to both improve performance and make machine learning more accessible. While many recent works have focused on aspects of the machine learning pipeline like model selection, hyperparameter tuning, and feature selection, relatively few works have focused on automatic data augmentation. Automatic data augmentation involves finding new features relevant to the user's predictive task with minimal ``human-in-the-loop'' involvement. We present \system, an end-to-end system that takes as input a dataset and a data repository, and outputs an augmented data set such that training a predictive model on this augmented dataset results in improved performance. Our system has two distinct components: (1) a framework to search and join data with the input data, based on various attributes of the input, and (2) an efficient feature selection algorithm that prunes out noisy or irrelevant features from the resulting join. We perform an extensive empirical evaluation of different system components and benchmark our feature selection algorithm on real-world datasets.